A Statistical Approach to Automatic OCR Error Correction in Context
نویسندگان
چکیده
This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexicon. Given a sentence to be corrected, the system decomposes each string in the sentence into letter n-grams and retrieves word candidates from the lexicon by comparing string n-grams with lexicon-entry n-grams. The retrieved candidates are ranked by the conditional probability of matches with the string, given character confusion probabilities. Finally, the wordobigram model and Viterbi algorithm are used to determine the best scoring word sequence for the sentence. The system can correct non-word errors as well as real-word errors and achieves a 60.2% error reduction rate for real OCR text. In addition, the system can learn the character confusion probabilities for a specific OCR environment and use them in self-calibration to achieve better performance.
منابع مشابه
OCR Error Correction Using Statistical Machine Translation
In this paper, we explore the use of a statistical machine translation system for optical character recognition (OCR) error correction. We investigate the use of word and character-level models to support a translation from OCR system output to correct french text. Our experiments show that character and word based machine translation correction make significant improvements to the quality of t...
متن کاملDiploma Thesis: Unsupervised Post-Correction of OCR Errors
The trend to digitize (historic) paper-based archives has emerged in the last years. The advantages of digital archives are easy access, searchability and machine readability. These advantages can only be ensured if few or no OCR errors are present. These errors are the result of misrecognized characters during the OCR process. Large archives make it unreasonable to correct errors manually. The...
متن کاملEducational Context and ELT Teachers’ Corrective Feedback Preference: Public and Private School Teachers in Focus
This study investigated the possible relationship between educational context and English Language Teaching (ELT) teachers’ corrective feedback preference. To this end, 42 Iranian EEFL teachers from some private language institutes and 39 Iranian EFL teachers from different schools in Shiraz, Iran participated in the study. The Questionnaire for Corrective Feedback Approaches (QCFAs) was ...
متن کاملOCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set
Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regret...
متن کاملLow-resource OCR error detection and correction in French Clinical Texts
In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training mater...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996